Group Members: Yingying Qian, Xiao Xiao, Qingnan Wang, Hanchen Liu, Runfeng Zhang

Enviroment: Google Colab


Link: https://colab.research.google.com/drive/1gfdWe4DIwi8EOkcUI5ET_vemQLWRofCF

Problem Statement

Suppose we are from a data scientist team at Warner Brothers, and are required to solve tasks as follows.

Task 1

WB has released some movies recently. However, currently we do not have enough movie reviews on the movies review website so we are not sure about how our audience react to the movie. However, there are a lot of reviews posted on the social media (Twitter, Facebook, etc.) and we have already scrapped these reviews. We want DS team to help us build models to predict audience sentiment.

Task 2

Topic models can help us analyze the movies that are discussed by our audience. By doing topic modeling, we know the hot topics about movies, which are useful to our marketing strategies. By analyzing these topic, we can make better decision on future movie topics. Besides, we can classify the reviews by different topics. Then users can easily find the reviews of the topics that they are interested in.

Note: This notebook may not be scalable for huge dataset like 50GB. For future need, codes should be adjusted to the large-scale data processing framework like Spark.

Import Modules

Load Data

Data Quality & EDA

Preprocess Reviews

Regex

Remove Stopwords

Stemming

Sentiment Prediction

Logistic Regression

Vectorizer

train test split

Logistic Regression

Logistic Regression with Spacy

LSTM & RNN Using Prof Chen's code

Tokenize Text

Import Keras Toolkit

Load in GloVe Vectors

Load in Embeddings

Define in Model

Compile Model - LSTM

Helpful Rule of Thumb for Defining # of Parameters in LSTM:

$$ W = 4d×(n+d) $$

Where $d$ is the number of memory cells, and $N$ is the number of dimensions for a data point.

Fit the Model

Evaluate the Model

Compile Model - RNN

Helpful Rule of Thumb for Defining # of Parameters in LSTM:

$$ W = 4d×(n+d) $$

Where $d$ is the number of memory cells, and $N$ is the number of dimensions for a data point.

Fit the Model

Evaluate the Model

Try Some Random Reviews

Sentiment Prediction with Hugging Face

Prediction with Raw Text

Prediction with Summerization

The output would be

Accuracy with raw text: 0.82

Results for all model built

Topic Modeling

Vectorize The Corpus

Fit NMF Model

Report Results For Each Topic

Get the Top Documents For Each Topic

WordCloud for Each Topic

We can see from wordcloud above, there are 5 topics in the review dataset.